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Marginal Checking as an Aid to Computer Reliability* 



NORMAN H. TAYLOR t, senior member, ire 



Sum/nary— Deteriorating component*, particularly crystal* and 
vacuum tubet, cute reduction of «*fety merlins and are a principal 
source of error in digital computing and pulse communication. 

Max final checking varies voltages to logical circuit group*, induc- 
ing inferior part* to cause failure, while a teat program or pul*e trans- 
mission detect* and localize* potential failure. In a digital computer, 
(hi* can b* automatically accomplished with the computer itself 
acting a* the detector. 

In one trial on a 400-tube prototype system the application of 
this type of preventive maintenance for half an hour per day im- 
proved reliability SO to 1 . Results of preliminary tests on a full com- 
puter are discussed. 

I. iNTKUUUtTION 

ELECTRONIC digital computers will be used to 
solve* real-time problems and must In: reliable. 
For example, when the modern computer l>c- 
comes the nerve outer of an all- weal her air traffic con- 
trol system, the plane pilot must know the system is op 
erating, and will continue to operate, without error, 
Such reliability can be guaranteed only by detecting 
imminent failures and preventing their occurrence. 

In order to obtain "computer reliability," a much 
higher degree of performance is required than in or- 
dinary means of communication. The basic difference is 
the high concentration of informal ion used in a com- 
puter compared with the concentration of informal inn 
in speech, television, or radar. Interruption^ in circuits 
of the latter type can occur at frequent intervals, with 
link- loss of intelligence. An occasional intermittent 
IiiIm? docs nol void iln.- sense from a radio, ignition noise 
does not cmi pli uly void television, nor does .w\ arcing 
magnetron the plot on a radar screen. 

This nil i i- not good enough in computer appli- 
cation* i '.~ii.il uuilioil of transmitting intelligence 
in a com 1 1 it ■ i- hi supply high-frequency pulses to par- 
ticutar ru ■» at specified times. A single pulse oc- 
curring a) I In- wrong rime can invalidate the usefulness 
mI the a hole ellotl. This single- error limitation is due 
to tlu- pi i -.nee of a memory in a computer. Memory 
remembers the errors as well as the information to be 
processed, and once an error becomes imbedded in the 
memory it can be propagated into all subsequent cal 
filiation. 

The necessary reliability can l*e approached by com- 
bining good design with the best available components, 
and utilizing marginal checking as an additional aid. 

Marginal checking differs from ordinary checking by 
not only answering the question, "Are all circuits func- 
tioning?" but also, "How much longer will the circuits 
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function?" (hxxJ equipment starts with wide safety 
margins, but age and wear reduce these safety margins, 
leading toeventual failure. Marginal checking assures ade- 
quate safety by testing the system frequenth enough so 
that only slight deterioration can occur between testa. 



II. The Marginal Checking System 
A. Magnitude of the Problem 

Most of the large-scale digital machines under devel- 
opment utilize many thousands of vacuum tubes, 
crystals, resistors, condensers, and coils. The vacuum 
lube is the least reliable component of this group, and 
the crystal rectifier, though better than the tul>e, is still 
a weak link in the chain of reliability. Failures in the re- 
sistors, condensers, and coils are not frequent, and 
these elements do not threaten computer reliability to 
such an extent. 

What may be expected of a system using present-day 
vacuum tultcs and crystals? A few assumptions will 
serve (o indicate the problem. If a typical computer 
has 5,0(11) cathodes and It), DUO crystals, suppose the 
tubes will la-. I on an average of 5,000 hours, and the 
crystals, 1(1,0(10 hours Kvcry 3d minutes one of these 
aging components may cause a failure. Furthermore, 
sunn- of these failures will not be steady but will . ,iusc 
marginal operation and thus be very difficult to [orate. 
In a typical 8-hour day this may cause 16 shutdowns. 
Even if a trouble- location technique is well developed, 
so that the period of shutdown is short, the elTn icncj of 
the machine will be very low. One might ask if a periodic 
replacement program could be followed which would 
eliminate many of these component failures. Unfortu- 
nately, earh failure in groups of new tubes is quite high, 
so that wholesale replacement on simply a time basis 
would increase the failure rate, 

B. Features of Marginal Checking 

The preventive maintenance techniques called mar 
ginal checking use performance margins to establish life 
expectancy of components, so thai those with low mar- 
gins can be removed during a testing period. 

Three features of this marginal checking scheme make 
it very practical for use in large electronic systems; 

(1) The checking system can detect imminent fail- 
ures before they become real failures and cause com- 
putational error. 

(2) This detection can isolate the failing com- 
ponent to a specific tube, crystal, or resistor. 

(3) Such isolation can be so rapid that it con- 
. tumes only a small percentage of total machine time. 
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/. Conversion to Real Failures: I he conversion of iin- 
minent failures to real failures during test periods is the 
important key in this marginal checking system. Such 
checking is possible in computers and also in many other 
pulse systems din- id t he mi mi nature of the circuitry 
used. 

In a computer, informal urn passes from one place to 
another as the presence <<t absence of a pulse mi a trans- 
mission line. Ii is not necessary thai the pulse he of 
any particular amplitude to get thi.s information t<> its 
destination hni only thai the pulse be large enough to 
affect the defector. If the presence of a pulse means a 1 
and the absence a 0, then a pulse which is too small to 
affect i he detector has the same effect as no pulse at alt 
and so a is recorded. 

a. A Simple Computer Channel 

Fig- 1 gives a typical basic Mock diagram ofien en- 
countered in pulse systems. Gate tube A, when open, 
allows pulses tu pass along a channel to a Hip-Hop. If the 
pulses are large enough \\m\ the (lip- flop in proper con- 
dition, each puke will cause a level sal of the flip-flop 
from a 1 to < <r vice versa. 
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Kii; 1 A typical computer channel. 

Two soils of trouble may develop. First, the Kate 
tube may deteriorate and cause the pulse amplitude to 
be reduced to a p >int where the flip Hop will not switch 
or, second, the (tip flop may refuse to switch because 
one of ii-. components h.i* deteriorated. 

b. Checking the Gate Circuit 

The margin of (lerfo nuance in the gate tube (.4) can 
be checked by lowering I he voltage on the screen of the 
tube by inserting a negative voltage in series with the 
screen lead .i^ shown in Fig. 2 (a schematic for gate 
lube circuit). The pulses emerging bom the tube will be 
lower than iliev were before the deviation. 

If both the (lip-flop ami gate as shown in Fig. 1 have 
adequate margins then this marginal checking of the 
gate circuit will make no difference. This can be de- 
tected by ani'ihei gale lube (Ii) which opens and closes 
according to t fw action of the Hipllop. If a sensing pulse 
i> applied lo gate tube Ii in Fig. 1, it will pass through 
to indicate I ha I I he flip-flop has switched and opened 
the channel. In the diagram shown this should occur 
for every other pulse passing through gate tube .-1. 

A low margin In gate tul«; A will interrupt this se- 
quence and no check pulse will erne rye from gate tube 




Kig, 2 — Marginal cheeking ul g.itc circuit. 

B. From such a test it can l>c determined whether or not 
the gale circuit is nearing an unsafe condition. The 
circuit shown in Fig. 2 has a nominal screen voltage of 
'JO volts. A typical margin would be minus 20 volts from 
this value. 

c. Checking the Flip-Flop 

This first check assumed that the flip-Hop was per- 
forming normally and acting as a detector for the ar- 
rival of pulses. To check this assumption the following 
t. -i can he made on the Hip-Hop circuit. 

Fig. 3 is a simplified schematic of a flip-flop. One tube 
must have the ability, when conducting, to hold the 
other tube in a nonconducting stale. The circuit is com- 
pletely symmetrical. Tube deterioraiion shows up as a 
reduction in plate current in one lube with a consequent 
reduction of bias available to the opposite luhe. The 
large cathode resistor allows considerable aging before 
the condition becomes intolerable but eventually tube 
deterioration will become so extreme that instability 
will occur and the flip-flop will favor one side. Then, 
whenever it is ordered to change sides by .tn incoming 
pulse the circuit will either fail to switch or fail to hold 
tts new position after switching takes place. 
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Fig. i —Marginal checking of flip-flop circuit. 

This unfavorable condition can be detected before it 
leads to failure by feeding the two screen circuits of the 
Hip Hop separately, as shown in Fig. 3, and selectively 
raising the screen voltage of the normally off tul>e a I tout 
30 volts (nominal value 120 vults). Raising its screen 
voltage also raises its number 1 grid cutoff voltage. The 
normally on tut>e must have a safe margin of plate cur- 
rent available if it is able to hold the tube being checked 
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off under these extreme rondi lions, if the on tube is 
weak it will fail to hold off the opposite tube and a spuri- 
ous switching ojierat ion will result. The detection of this 
condition can l»e automatic by using the sensing pulses 
and gate circuits shown in Fig. 1. 

d. Testing Crystals in a (lamp Circuit 

A third type of conversion which will pick up aging 
crystals is of considerable interest. Fig. 4 shows a 
damping circuit which couples the plate of a flij)-flop to 
,i g.itc lube, Proper operation of this circuit depends 
on i he back resistance of the crystal staying at a high 
value sci that proper clan\j>ing action will be available 
during the period between the voltage pedestals used 
for clamping. If the crystal deteriorates, the voltage at 
the grid of the gate circuit will appear as shown at the 
right of l he diagram. Serious deterioration will result 
in the opening of the gate circuit when it shouM be 
closed. 
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ti(£. t Marginal chrrking ol rlamp crystal reclint'rj. 

i*n convert this imminent failure to a real one, a 
change in the timing of the clamping period is used. A 
■„i»HJ itvs1.iI will operate when a much longer [X'no.l is 
allowed, but a deteriorating unit will not hold the bias 
i hat long and a failure will result. Values of 16 micro- 
seconds and 64 microseconds have been used effec- 
tively in this circuit. If a sensing pulse to the gate tube 
under control of this clamjj circuit is inserted near the 
end of this longer wait period it will l>e rejected bya 
good crystal and passed by a deteriorating one. This 
scheme can then l>e automatized. 

2. Localising Failures: Once an imminent failure has 
l>een con verted to a real failure by any one of the meth- 
ods noted above, the problem of detecting the fault and 
localizing it to a particular source can be very time- 
consuming if it is not approached in an orderly manner 
Fault isolation can be solved if the computer is divided 
for marginal checking into small logical sections. To 
simplify the trouble-location scheme, sections should be 
chosen so that at a given time only one fault can exist. 

I In logical design of a computer separates it into 
many channels, all starting at the pulse source and dis- 
persing throughout the system to a destination. 



Fig. 5 shows two of these typical channels separated 
into four sections. The vertical lines indicate how these 
channels may be broken for purposes of marginal check- 
ing and isolation of faults. In each case a pulse starts 
from the distributor along its channel and arrives at its 
destination with enough energy to change the condi- 
tion of a flijvflop circuit in the destination section. If 




Fig. 5 — Computer marginal checking 

each section is subjected to voltage variation and the 
sequence still functions, the channel can be said to have 
adequate margins. 

The addition of a checking section to these channels 
allows the checking routine to be carried out automati- 
cally by the computer. An error-sensing pulse checks 
that the information arriving at the checking section via 
the channel under test is the same as that arriving by 8 
separate checking channel. If the two pieces of informa- 
tion disagree, an alarm is sounded and immediately 
the pulse distributor is stopped. 

Knowing the stopping point of the distributor, the 

' channel at fault is isolated. In addition, knowledge of 

the section under voltage variation isolates the tulie in 

the channel. The operator can usually find surh troubles 

in a few minutes during such a test routine 

These channels are not used simultaneously bul in - a 
time sequence so tubes of the same type, but in d liferent 
channels, may be grouped in the same section 
voltage variatinu and no loss in isolation n-= •- 

3. Automatic Marginal Checking: The * hole sequ. • 
of sending pulses through each of the i_hann<-N ' -eeii 
automatized in the Whirlwind Computer system * 
sored by the Office of Naval Research it tlw * ' 
setts Institute of Technology. 1 Some 200 •action* ■ 
used. The computer prr>gram sends the putsea ... 
each of the channels in a fraction of a si i •>■• ! 
sections are selected by telephone switching apparatus 
and subjected to voltage variation at S-sei 
In this way the whole system ran he contplei»-i> 
checked in about IS minutes. 

1 f'hc Whirlwind I unipinei lit .in ein I runic tJi^n.i. .... him- < j|m 

lile (it perfiirniing .11 vi>r\ liiich tptTil; i.e. 1 t,IIO<l niu!'t|>J » 

•wood. 
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At present it appears that establishment of adequate 
margins once each day will be an excellent guarantee 
that the next 24-hour period will be completely free from 
error. 

It is evident that the basic principles of marginal 
checking discussed in this paper are simple; but the sys- 
tem must be carefully designee! to reap advantages of 
the checking in an economical way. Too many checking 
circuits complicate the equipment; not enough will fail 
to give unique indications and will not isolate defective 
components. 

III. CONCLUSION 

The most significant information about marginal 
checking is its performance record. Over a period of 
eight months, a 5-binary-digit prototype arithmetic 
element at MIT has been running a test problem over 
and over 24 hours a day. This test system contains about 
4110 vacuum tubes and 1,000 crystals, and marginal 
checking is done manually for a period of \ hour a day 
and deteriorating components are removed. This equip- 
ment has made several runs of three weeks without 
computational error which represents 2.5 XIO" correct 
solutions of the problem, and about 10 11 correct flip- 
flop reversals in 25 flip-flop circuits. The average run 
without error has been eleven days, which represents 
approximately a 50-to-l improvement in the results ob- 
tained before marginal checking was installed. A run ot 
forty -five days without error was made in early 1950. 
During this forty-five-day period, 12 tubes, 7 crystals, 
and 4 resistors were located during marginal checking 
periods and replaced because of low margins. 

When one begins to work with larger systems, there is 
reason to believe that, with marginal checking, errors 
will not increase in proportion to the extra equipment 
involved. A high percentage of the remaining errors are 
caused by power failure, lightning, and external dis- 
turbances independent of the numlrer of vacuum tubes 
in the system. 

A measure of the success of marginal checking in im- 
proving the performance of the Whirlwind Computer is 
shown in Table I. 

At present, 3,°00 tubes and 1 1,000 crystals have been 
running for about 3,300 hours, il registers of test 

TABLE I 

Tube and Crystal Kaii-ckrs* 



Tubes Cry i tab 



Ntiinlxr in uv. 

Total failures . 
Obvious faults .... 
Deterioration of operating character n t ii s 

Failures located by marginal checking 



* Note — Majority of tub™ and crjsials were in 
3,390 hours. 
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storage, made up of toggle switches and flip-flops, allow 
the solution of several problems which thoroughly test 
the computer. 

During these installation tests, 187 tubes have been 
removed, 109 of which have lieen located by marginal 
checking techniques. The majority of tube failures with 
deteriorating characteristics have been due to the forma- 
tion of an apparent resistance on the cathode sleeve or in 
the cathode coating. This defect has been called inter- 
face resistance. 

Obvious tube faults have been due to gas, broken pins, 
internal short circuits, and open welds. Many of these 
have l>een located by the built-in checking system of 
the computer without the aid of marginal checking. 

Of the 272 crystal failures, 223 were located by the 
marginal-checking technique. The most serious fault has 
been a drifting of back resistance to a tower value by a 
(actor of 2 to 10 with the continued application of 
voltage. The cause of this is not well understood but 1 
to 10 per cent of new crystals exhibit this tendency after 
voltage has been applied for a period of 30 to 60 sec- 
onds. A few obvious faults have l>een due to completely 
open or short-circuited crystals. 

About a dozen tubes and a few crystals have been 
intermittent. The on-ofl intermittent is the most diffi- 
cult fault to locate in electronic circuits. Marginal check- 
ing does not aid in isolating this type of failure and this 
represents one limitation in the system. Complete failure 
such as filament burnout also cannot be predicted. 
However, in 3,300 hours of operation, only two tubes 
have exhibited such failure. 

Some of the by-products of marginal checking have 
proved invaluable in testing the Whirlwind system. 
Many low performance margins have been found which 
were due to design weaknesses and not to deteriorating 
components. 

Refinements have been made in the design to reduce 
noise level and improve timing of pulse sequences and 
frequency response. These improvements have all been 
possible earlier in the program than usual, due in a large 
measure to marginal checking. 
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